Skip to content

🐛 fix(subprocess): add timeout to interpreter probing#42

Merged
gaborbernat merged 2 commits intotox-dev:mainfrom
gaborbernat:fix
Mar 6, 2026
Merged

🐛 fix(subprocess): add timeout to interpreter probing#42
gaborbernat merged 2 commits intotox-dev:mainfrom
gaborbernat:fix

Conversation

@gaborbernat
Copy link
Member

@gaborbernat gaborbernat commented Mar 6, 2026

_run_subprocess calls process.communicate() with no timeout when probing candidate Python interpreters. On Windows CI runners, this subprocess can hang indefinitely — typically caused by Windows Store Python stubs, antivirus file locks, or pipe I/O race conditions between short-lived child processes and the parent's reader threads. 🪟 Analysis of the last 30 days of tox CI revealed 18 flaky timeout failures across 9 different tests, almost exclusively on windows-2025 runners (with 2 on macos-15).

The failures fall into two patterns that share the same deadlock chain in tox's common.py. Pattern A (discovery): the tox-driver thread hangs in _run_subprocessprocess.communicate() while probing interpreters. Pattern B (install/provision): a package install or provision subprocess hangs. In both cases, thread.join() blocks the main thread so KeyboardInterrupt can never be delivered, as_completed() blocks the interrupt thread so it can't check the interrupt event, and executor.shutdown(wait=True) prevents cleanup — creating an unbreakable deadlock.

The fix adds a 5-second timeout to process.communicate(). ⏱️ The probed subprocess only reads sys attributes and prints JSON — a healthy interpreter completes this in milliseconds, so 5 seconds is generous. On timeout the hung process is killed and treated as a failed probe, allowing discovery to skip it and continue to the next candidate. This fits naturally into the existing error-handling flow since non-zero exit codes already produce a RuntimeError that callers handle gracefully. A companion PR in tox addresses the deadlock chain in common.py to handle Pattern B and make the execution engine interruptible.

Also fixes a pre-existing ty type-check failure where ty 0.0.17 with --python-version 3.8 completely misresolves pytest.skip (a _WithException-wrapped function) as a bound method with no parameters — replaced with raise pytest.skip.Exception(msg) which bypasses the broken wrapper.

On Windows CI, the subprocess spawned to probe a candidate Python
interpreter can hang indefinitely — triggered by Windows Store stubs,
antivirus holds, or pipe I/O race conditions. This caused ~18 flaky
timeout failures across 9 different tests in tox over the last 30 days,
almost exclusively on windows-2025 runners.

The root cause is process.communicate() being called with no timeout
in _run_subprocess. Adding a 5s timeout and killing the process on
expiry allows discovery to skip unresponsive interpreters and continue.
@gaborbernat gaborbernat added the bug Something isn't working label Mar 6, 2026
@gaborbernat gaborbernat force-pushed the fix branch 7 times, most recently from d91ec9d to 8c22da5 Compare March 6, 2026 23:16
ty on 3.8 reports too-many-positional-arguments for pytest.skip().
@gaborbernat gaborbernat enabled auto-merge (squash) March 6, 2026 23:49
@gaborbernat gaborbernat merged commit 30f67d2 into tox-dev:main Mar 6, 2026
13 checks passed
@gaborbernat gaborbernat deleted the fix branch March 6, 2026 23:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants